Group 33 - Xinghao Huang - 81848509¶


Section 0: TA feedback¶

  • score: 29/30
  • Feeback on Scientific Question: which input variable(s) will be included for location?
  • Feedback on Question Focus: "understanding the effects of room type, cleanliness rating, and location on customer satisfaction": this implies that customer satisfaction is the response?
  • Feedback on Question Focus: The word "effects" can be misleading. Association vs causation. We usually can't use observational data to make causal inference.

Section 1: Data Description¶

1. Descriptive Summary¶

  • There are 5280 observations in total (2653 observations in athens_weekdays.csv dataset, 2627 observations in athens_weekends.csv), and 19 variables (both datasets have the same variables, here variable Id is not counted since it is just an identifier rather than a meaningful variable).

  • Variable Summary:

Variable Type Description
realSum Quantitative data the total prices of the listing in EUR
room_type Categorical/nominal data different room types, including private, shared, entire home, apt.
room_shared Categorical/binary data whether a room is shared
room_private Categorical/binary data whether a room is private
person_capacity Quantitative data number of people a room can accommodate
host_is_superhost Categorical/binary data whether a host is a superhost
multi Categorical/binary data whether the listing is for multiple rooms
biz Categorical/binary data whether an observation is associated with a business
cleanliness_rating Quantitative data rating of cleanliness
guest_satisfaction_overall Quantitative data overall rating from guests comparing all listings offered by the host
bedrooms Quantitative data number of bedrooms
dist Quantitative data distance from city center
metro_dist Quantitative data distance from the nearest metro station (km)
attr_index Quantitative data attr index
attr_index_norm Quantitative data normalized attr index (km)
rest_index Quantitative data rest index
rest_index_norm Quantitative data normalized rest index
lng Quantitative data longitude coordiates for location identification
lat Quantitative data latitude coordiates for location identification

2. Source and Information¶

  • The datasets were originally obtained from Gyódi and Nawaro (2021), Determinants of Airbnb Prices in European Cities: A Spatial Econometrics Approach (supplementary material), published on Zenodo.

  • The data were collected from Airbnb listings across multiple European cities, focusing on listing attributes, host information, and spatial factors affecting pricing.

  • This dataset offers a detailed overview of Airbnb prices in Athens, including information on room type, cleanliness and satisfaction ratings, number of bedrooms, distance from the city centre, and other attributes that help explain price differences between weekday and weekend stays.

  • Citation: Gyódi, K., & Nawaro, Ł. (2021, March 25). Determinants of Airbnb prices in European cities: A Spatial Econometrics Approach (supplementary material). Zenodo. https://zenodo.org/records/4446043#.Y9Y9ENJBwUE

3. Preselection of Variables¶

  • room_shared, room_private, and multi have redundant information because we can also acquire the same and even more complete information from room_type and bedrooms.
  • lng and lat will be dropped because they only provide raw spatial coordinates, and information regarding distance can be acquired from dist and metro_dist
  • attr_index, attr_index_norm, rest_index, and rest_index_norm will also be dropped because their definitions and interpretations are unclear from the dataset documentation, and they seem like post-analysis results.

Section 2: Scientific Question¶

1. State the Question¶

  • (Updated based on feedback from TA & Remove cleanliness rating): Question: How is the Airbnb price in Athens associated with day type, room type, customer satisfaction, and location from the city center?
  • Specifically, I want to understand which of these factors has the strongest relationship with the Airbnb price.

2. Name the Response¶

  • The response variable is realSum (the Airbnb price in Athens).

3. Question Focus¶

  • (Updated based on feedback from TA): My question mainly focuses on inference since it is about understanding the association between room type, cleanliness rating, and location on the Airbnb price rather than predicting new outcomes

Section 3: Exploratory Data Analysis and Visualization¶

1. Reproducible Code¶

In [1]:
# load some libraries
library(ggplot2)
library(dplyr)
library(patchwork)

# I initially used install.packages(...), but it was not work
Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



The two datasets have been uploaded from my local devices to the STAT 301 Workspace. The below provides how they can be loaded into R

In [2]:
# reading the file
athens_weekdays <- read.csv("/home/jovyan/work/stat-301/materials/Project/data/athens_weekdays.csv", header = TRUE)
athens_weekends <- read.csv("/home/jovyan/work/stat-301/materials/Project/data/athens_weekends.csv", header = TRUE)

# check if there are any missing values
sum(is.na(athens_weekends)) == 0
TRUE


Now, I will add a column indicating the type of day each observation is. It has 2 levels: Weekdays and Weekends. Then, the two datasets will be merged into one dataset called athens.

In [3]:
# add indicator columns to both
athens_weekdays <- athens_weekdays %>% mutate(day_type = as.factor("Weekdays"))
athens_weekends <- athens_weekends %>% mutate(day_type = as.factor("Weekends"))

# merge the two datasets into one
athens <- rbind(athens_weekdays, athens_weekends)
head(athens)
A data.frame: 6 × 21
XrealSumroom_typeroom_sharedroom_privateperson_capacityhost_is_superhostmultibizcleanliness_rating⋯bedroomsdistmetro_distattr_indexattr_index_normrest_indexrest_index_normlnglatday_type
<int><dbl><chr><chr><chr><dbl><chr><int><int><dbl>⋯<int><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><fct>
10129.82448Entire home/aptFalseFalse4False0010⋯22.81396350.8818900 55.348572.086871 78.77838 5.91516023.7660037.98300Weekdays
21138.96375Entire home/aptFalseFalse4True 1010⋯10.40729290.3045679240.306659.060559407.1677030.57262923.7316837.97776Weekdays
32156.30492Entire home/aptFalseFalse3True 0110⋯11.23721110.2884881199.507377.522257395.9674029.73164223.7220037.97900Weekdays
43 91.62702Entire home/aptFalseFalse4True 1010⋯14.36745720.2974673 39.803051.500740 58.70658 4.40804723.7271238.01435Weekdays
54 74.05151Private room FalseTrue 2False0010⋯12.19418500.3852657 78.733402.968577113.32597 8.50920423.7339137.99529Weekdays
65113.88934Entire home/aptFalseFalse6True 1010⋯22.07120560.4538674 96.588993.641806158.6443211.91198123.7158437.98598Weekdays
In [4]:
summary(athens$realSum)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
   42.88    98.66   127.72   151.74   171.54 18545.45 


Note that there are potentially extreme outliers in realSum. They will make it harder to see the pattern of the majority of individual observations. Therefore, I will filter them out to have a better view for the visualization.

The values within the whiskers, [Q1 - 1.5IQR, Q3 + 1.5IQR], are included, which are the non-outlier observations among the original values of realSum.

In [5]:
# filter the data
realSum_within_range <- athens %>%
    group_by(room_type, day_type) %>%
    filter( (realSum >= quantile(realSum,0.25)-1.5*IQR(realSum)) & (realSum <= quantile(realSum,0.75)+1.5*IQR(realSum)) ) %>%
    ungroup() %>%
    select(realSum, day_type, room_type, dist) # these 4 variables will be used for the visualization(s)

head(realSum_within_range)
A tibble: 6 × 4
realSumday_typeroom_typedist
<dbl><fct><chr><dbl>
129.82448WeekdaysEntire home/apt2.8139635
138.96375WeekdaysEntire home/apt0.4072929
156.30492WeekdaysEntire home/apt1.2372111
91.62702WeekdaysEntire home/apt4.3674572
74.05151WeekdaysPrivate room 2.1941850
113.88934WeekdaysEntire home/apt2.0712056

2. Visualization¶

The cell below will use both filtered and unfiltered datasets to generate boxplots that are faceted by room_type. day_type is encoded in the x-channel, realSum is encoded in the y-channel, and day_type is encoded in the fill-channel, with individual observations added.

In [6]:
# boxplot with original realSum values
box_price_by_room_original <- athens %>%
    ggplot(aes(x = day_type, y = realSum, fill = day_type)) +
    geom_boxplot(fatten = 4) + # adjust the width of the median bar
    geom_jitter(color="gray", size=0.4, alpha=0.6) + # adding individual observations
    facet_grid(~room_type) + # facet by room_type
    ggtitle("Unfiltered Airbnb Prices Distribution per Room/Day Type") +
    labs(x = "Day Types", y = "Unfiltered Airbnb Price in Athens", fill = "Day Type")


# boxplot with filtered realSum values
box_price_by_room_filtered <- realSum_within_range %>%
    ggplot(aes(x = day_type, y = realSum, fill = day_type)) +
    geom_boxplot(fatten = 4) + # adjust the width of the median bar
    geom_jitter(color="gray", size=0.4, alpha=0.6) + # adding individual observations
    facet_grid(~room_type) + # facet by room_type
    ggtitle("Filtered Airbnb Prices Distribution per Room/Day Type") +
    labs(x = "Day Types", y = "Filtered Airbnb Price in Athens", fill = "Day Type")

The cell below will use both filtered and unfiltered datasets to generate scatterplots that are faceted by room_type. dist is encoded in the x-channel, realSum is encoded in the y-channel, and room_type is encoded in the color-channel

In [7]:
scatter_price_vs_dist_original <- athens %>%
    ggplot(aes(x = dist, y = realSum, color = room_type))+
    geom_point() + 
    facet_grid(~room_type) + 
    ggtitle("Unfiltered Airbnb Price of Each Room Type Versus Distance from City Center ") +
    labs(x = "Distance from City Center", y = "Unfiltered Airbnb Price in Athens", color = "Room Type")


scatter_price_vs_dist_filtered <- realSum_within_range %>%
    ggplot(aes(x = dist, y = realSum, color = room_type))+
    geom_point() + 
    facet_grid(~room_type) + 
    ggtitle("Filtered Airbnb Price of Each Room Type Versus Distance from City Center ") +
    labs(x = "Distance from City Center", y = "Filtered Airbnb Price in Athens", color = "Room Type")


The below cell will concatenate theses plots into one

In [8]:
# resize the plot for a better view
options(repr.plot.width = 12, repr.plot.height = 10)

# concatenate these plots into one
(box_price_by_room_original + scatter_price_vs_dist_original )/(box_price_by_room_filtered+ scatter_price_vs_dist_filtered)
No description has been provided for this image

3. Interpretations¶

Explain why you consider this plot relevant to address your question or to explore the data.¶

  • The boxplot(s) show the price distribution by room and day type, while the scatterplot(s) explore how location is associated with price for each room category.
  • Removing extreme outliers is important because they distort the scale and hide overall pricing patterns.
  • With filtered data, the plots clearly reveal typical price ranges and meaningful relationships between key factors influencing Airbnb prices.

Interpret briefly the results obtained.¶

  • For the plots with filtered Airbnb price, the Entire home/apt category has the most listings and the highest prices overall, with a wider spread compared to Private room and Shared room, suggesting greater price variation in full apartments.
  • There is no strong distinction between weekday and weekend prices within each room type, indicating that daily demand fluctuations may not heavily affect Airbnb pricing in Athens.
  • Most Airbnb listings are close to the city center, where a wide range of prices exists, implying that location alone may not fully explain price differences among listings.

What do you learn from your visualization?¶

  • Isolating the effect of room type will be essential in later inference stages, as each room type shows a distinct price distribution. Without doing so, I may encounter issues such as Simpson’s Paradox, which could lead to misleading conclusions when combining groups.
  • Because these visualizations exclude extreme outliers in realSum for clarity, the future analyses should consider their impact, which may largely change the results or increase variability, affecting the reliability of model estimates.

Section 4: Method and Plan¶

1. Method Proposal¶

  • Model Selection: A multiple linear regression model will be suitable to address the proposed question
  • This model is appropriate because it allows us to quantify how Airbnb prices change with respect to day type, room type, guest satisfaction, and distance.
  • And, since the response will be the Airbnb price, Logistic regression that models the log-odds or Poisson regression that models the log-mean count are not appropriate to model the price.

2. Method Assumptions¶

  • The true mean price is a linear function of the predictors
  • The error terms are independent, normally distributed with equal variance

3. Method Limitations¶

  • A multiple linear regression could have a negative expected response, which is invalid for Airbnb prices.
  • The above assumptions may not hold, for example, the relationship between the response and covariates may not be linear, so fitting a multiple linear regression model may not correctly model the data.
  • Observational data prevent any causal conclusions, so all findings should be interpreted as associations.

Section 5: Computational Code and Output¶

1. Computational Code¶

In [23]:
price_mlr <- lm(realSum ~ day_type + room_type + guest_satisfaction_overall + dist, data = athens)

price_mlr
Call:
lm(formula = realSum ~ day_type + room_type + guest_satisfaction_overall + 
    dist, data = athens)

Coefficients:
               (Intercept)            day_typeWeekends  
                   116.299                      -6.950  
     room_typePrivate room        room_typeShared room  
                   -39.393                     -78.057  
guest_satisfaction_overall                        dist  
                     1.032                     -31.079  

2. Summary Table¶

In [24]:
price_mlr_result <-
    price_mlr %>%
    broom::tidy(conf.int = TRUE) %>%
    mutate_if(is.numeric, round, 3)

price_mlr_result
A tibble: 6 × 7
termestimatestd.errorstatisticp.valueconf.lowconf.high
<chr><dbl><dbl><dbl><dbl><dbl><dbl>
(Intercept) 116.29942.528 2.7350.006 32.926199.671
day_typeWeekends -6.950 7.266-0.9570.339 -21.194 7.294
room_typePrivate room -39.39313.802-2.8540.004 -66.451-12.334
room_typeShared room -78.05779.674-0.9800.327-234.252 78.138
guest_satisfaction_overall 1.032 0.436 2.3670.018 0.177 1.887
dist -31.079 3.810-8.1580.000 -38.548-23.610

3. Interpretations¶

day_type¶

  • At 5% significance level, there is no statistical evidence to say that the price on weekends is different compared to the price on weekdays.

room_type¶

  • Compared to the entire home/apt (the baseline category), private rooms are expected to be priced 38.827 EUR lower on average, holding other variables constant, and at 5% significance level, there is significant evidence that private rooms differ in price relative to entire homes, while there is no evidence to say that shared rooms differ in price relative to entire homes.

guest_satisfaction_overall¶

  • Holding day_type, room type, and dist constant, each additional point in the overall rating from guests comparing all listings offered by the host is associated with an expected increase of about 1.032 EUR in the total price of the listing, and at 5% significance level, there is statistical evidence that the overall satisfaction is positively associated with the total price of the listing

dist¶

  • Holding day type, room type, and guest satisfaction constant, each additional kilometre between the listing and the city center is associated with an expected decrease of about 31.079 EUR in the total price of the listing, and at 5% significance level, there is statistical evidence that distance from the city center is negatively associated with the total price of the listing.